A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction

Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.


Introduction
With the expansion of the biomedical literature, many efforts have been made, either by improving search processes or automatic identification of relevant content [1], to aid researchers and clinicians in navigating this overwhelming amount of data and generating data resources that can be used for further study [2].Research in biomedical text-mining and natural language processing has been shown to be key to supporting such biocuration, facilitating to a certain extent the population of biomedical knowledge resources.
Serving as a building block in such curation activities, biomedical Relation Extraction (bioRE) aims to detect and classify relations between biomedical entities within a text.It is often paired with the Named Entity Recognition, as it first needs to identify relevant entities [3].While early relation extraction (RE) relied on template and rule-based methods [4], traditional machine learning methods involving features such as re-occurring terms or the structure of the sentences have emerged, offering a first foray into automated text-mining approaches for such a task [5][6][7][8].New advances and state-of-the-art performances have been afterwards obtained with deep learning techniques [9]: convolutional and recurrent neural networks radically increased the accuracy for bioRE by exploiting the latent dependencies between words [10][11][12][13][14]. Additionally, graph neural networks (GNNs) that incorporated rich linguistic features, both within and between sentences [15][16][17] within the graph, provided an alternative to boost performance.Nonetheless, the current best models are adaptations of the Bidirectional Encoder Representation for Transformer (BERT) architecture [18], requiring only additional training or fine-tuning with biomedical papers to make them relevant for the biomedical domain [19][20][21][22].An overview of pre-trained transformerbased biomedical language models is reported in Kalyan et al. [23], which also highlights the challenges of the use of deep learning models, of which one is the amount of data needed for retraining.
Luo et al. recently published an overview of existing bioRE data sets, showcasing the need for high-quality, complex data sets [24].At the moment, most of the RE data sets are focused on single binary relations, i.e. relations involving two entities of specific types only, such as protein-protein interactions or chemical-disease interactions, found at the single sentence level.Challenges thus remained to train systems to find multiple RE types, RE involving more than two entities, as well as RE across multiple sentences.With BioRED, the authors partially filled that gap by creating a data set with diverse types of binary relations involving multiple entity types at the document level.While this type of data set promotes the development of more robust and accurate bioRE models, the authors have noted that one of its limitations is its size.Indeed, they could only include 600 abstracts as manual annotation is costly in time and expertise, a recurring problem for all RE data sets.However, due to the increasing use of deep learning methods for bioRE, large amounts of data are needed more often to achieve good performances.
As labelling additional samples in data sets may be difficult and substantial unlabelled data is readily available, alternative approaches are required that somehow sample optimally the unlabelled cohort data to train high-quality RE models without having to explicitly label all of the data.
Active learning (AL) is a key approach that provides a way to minimise the annotation cost [25].It is achieved by training a model with a partially labelled data set and then the AL process itself will actively select among the unlabelled samples the set of samples it considers to be the most informative.This may be achieved for example by choosing the samples for which the model is the most uncertain about their predictions (prediction-based) or by selecting the data the most different from what it has already been trained on (diversity-based).An oracle, normally human, provides the correct labelling for these samples, which are subsequently added to the training set.The cycle begins anew with the newly obtained training set to train the model until a stopping criterion, such as a minimal performance of the trained model obtained on the test set or a number of newly labelled instances, is reached (Fig 1).Using such techniques minimises the amount of data to label to reach optimal performance.AL has already been extensively studied for several text-mining tasks, such as NER [26][27][28][29][30][31][32] or text classification [33][34][35][36], both for general and clinical domains, as well as for traditional [26,28,30,32] or deep learning models [31,33,34,36,37], and it has been shown that the maximum accuracy may be reached with only a subset of the total data set.While AL has been studied for some bioRE data sets with simpler traditional machine learning models (e.g.random forests) [30,38,39], it has not yet been bench-marked with cutting-edge RE models, such as BERT-based models.Nonetheless, AL may prove to be an essential component in such data-hungry approaches.
In this work, we present a first study of the use of AL for labelling bioRE data sets, using a deep learning model (PubMedBERT) as a learner.We show that AL can reduce the need for labelling for bioRE and that strategies based on the predictions of the model allow the model to reach faster an optimal performance in terms of F1-score, accuracy and precision than by randomly labelling.We also demonstrate that a diversity-based strategy, called Core-set, outperforms all other AL strategies in terms of recall.We conclude this article by presenting some guidelines for the use of AL in real-world scenarios for labelling bioRE data sets.

Description of the data sets
In this work, we evaluate seven different bioRE data sets, covering several types of entities (i.e., gene, protein, chemicals, drug and variant) and containing relations at the sentence and As a pre-processing step, we perform entity masking, where each entity is masked with a type in the text, e.g.@GENE$ for an entity belonging to the gene type, as was done previously in similar studies using PubMedBERT [20,24,40].This ensures that the classification of the relationship relies on the context around the entities and not the entities themselves, producing a better generalisation of the model [41].When only positive instances are available for the data set, negative instances are generated by creating false relationships between entities with no existing connection in the data set.It should be noted that this process may produce an imbalanced data set biased towards the negative class, which has to be addressed in the bioRE task.
The general statistics of the investigated data sets are summarised in Table 1.Pre-processed data sets, and the scripts to import and convert data, can be found at https://github.com/oligogenic/Deep_active_learning_bioRE/tree/main/data.

Active learning selection strategies
AL is conducted either in a stream-based or pool-based scenario.The stream-based scenario considers each unlabelled instance independently for labelling, while in the pool-based scenario, all the unlabelled instances are evaluated at the same time and only the top best queries, also called active batch, are selected for labelling [25].In this work, the pool-based scenario is used as all the unlabelled instances are available at the start of the process and the data set will not change dynamically during the process, as is implied in a stream-based scenario.
As the data sets used in this work are fully annotated, the oracle is simulated, avoiding in this way any additional manual labelling [47].
The instances to be labelled are sampled according to a variety of strategies, which are classified in several categories according to how they characterise the instances as most interesting to be included in the training set: diversity-based, model-based and prediction-based.A noteworthy method not included in the above categories is the Random Sampling method.This method is often used as a baseline as it mimics the real-life settings where each instance has an equal probability of being selected while labelling and samples are not selected in a biased way.
Six strategies were selected for the current analysis, including Random Sampling as a baseline, covering the diversity-based and prediction-based categories, as well as one hybrid method combining the two categories.No model-based methods were chosen for this study as they require excessive computational resources [48].These methods were selected for their common uses as baselines in similar study of AL benchmarks in text classification and sentiment analysis [33,36].Diversity-based instance sampling.Diversity-based methods, also called density-based or data-based methods, rely only on properties of the raw data to select the instances, e.g. by choosing the most diverse set of instances.
The Core-set method [49] was selected, as it is considered a standard diversity-based method in deep AL.In this method, the selected batch of unlabelled instances is generated to be the most different from the already labelled training instances as well as the most diverse among themselves.First, the distance between all the instances is computed as the Euclidean distance between the embeddings obtained from the last layer of the transformer.Then, an unlabelled instance, the furthest from the labelled instances, is selected to be added to the labelled set.This operation is repeated until the desired number of unlabelled instances is added.The aim of this approach is to minimise the distance between the labelled set and the unlabelled set.
Prediction-based instance sampling.Prediction-based methods use the outputs of the model to score the instances.The instances with the highest scores, e.g.deemed as the most uncertain according to an uncertainty measure, are selected for labelling [25].
Three prediction-based methods using different uncertainty measures were chosen for the current study based on their frequency of use in the literature [50].
In the following equations, we denote the instance as x, the number of classes as C, ŷi is the i-th most likely class predicted for the instance x and Pðŷ i jxÞ is a probability-like predicted class distribution obtained for the instance x for the class ŷi .
First, the Least-Confident Sampling method measures the difference between the most confident prediction and 100% confidence [51].It is calculated as follows: Second, the Margin Sampling method takes the difference between the top two most confident predictions [52,53].In our case, the method simply uses the difference between the predictions for the first (ŷ 1 ) and the second (ŷ 2 ) most likely class probabilities.
Finally, the Entropy Sampling method computes the difference between all predictions, as defined by information theory [54,55].
Pðy i jxÞlogðPðy i jxÞÞ Hybrid instance sampling.As hybrid method, BatchBALD [56] was selected as it is known to perform particularly well for Deep Learning models [57].BALD (Bayesian Active Learning by Disagreement) selects points which have the highest mutual information between model predictions and parameters ω of the model.This is estimated with the following equation: Iðy i ; x; oÞ ¼ Hðy i jxÞ À E pðoÞ ½Hðy i jx; oÞ� with H being the conditional entropy of the model's prediction for the class y i , i.e. the general uncertainty of the model, and E representing the expected value of the entropy of the model prediction over the posterior of the model parameters ω, i.e. the expected uncertainty for a given sampling of the model parameters.So the score will be high when the model is highly uncertain for the prediction (high H) while having many draws of the model parameters disagreeing which is the best way to explain the prediction (low E), meaning a high score can be translated as a sample with high uncertainty and high disagreement between the model parameters to explain the prediction.
BatchBALD builds on the BALD acquisition function by computing the mutual information for a set of instances (i.e., a batch) instead of a unique instance.In that case, the expectation E can be approximated using a Monte Carlo estimator to sample parameter distribution of the model.In our experiments, 10 inference cycles are performed following the specifications in the original article [56].

PubMedBERT model
PubMedBERT is a pre-trained transformer-based biomedical language model [20].It follows the classical BERT architecture, based on a transformer, and was pre-trained using purely biomedical texts from PubMed, both abstracts and full-text articles.It is considered one of the state-of-the-art biomedical natural language processing models and has been deployed in a variety of studies in the last years [24,58,59].
For the RE task, we used a sentence classification framework with PubMedBERT.Different hyper-parameters are used for each of the data sets (see Table 2).Some of the parameters, controlling the training of the model, such as the maximum length of the input allowed, the number of training epochs and the learning rate hyper-parameters are tuned using the information in Gu et al. [20] and Lai et al. [60].The rest of the hyper-parameters of the model, such as the optimiser, were kept as in Gu et al. [20].This choice has been made as we want to study the AL process, which does not need an optimisation of the hyper-parameters.Hyperparameter optimisation can be done once the final labelled data set has been obtained and one desires the optimal performance for that data set.
The other hyper-parameters, i.e. the size of the training seed and the active batch size, controlling the active learning process, have been chosen so that after ten iterations of the AL process the whole pool is used (Table 2).

Experimental settings
A k-fold cross-validation was used (k = 5), (Algorithm 1-Line 1) and the selection strategies were repeated n = 3 times (Algorithm 1-Line 3) with different training seeds, giving a total of 15 executions for each pair of strategy and data set.Overall, 6,300 fine-tuning experiments of the PubMedBERT model were performed (  The main scripts for the benchmark used the libraries from huggingface: transformers, evaluate and accelerate [61]; as well as the DISTIL (https://github.com/decile-team/distil)library for the active learning methods.Statistical tests were conducted in R with the package scmamp and in Python with scipy.
A pseudocode for the process can be found in Algorithm 1. Performance measurements were computed after each training step of the model, by evaluating the trained model on an independent test set.It should be noted that the model is fine-tuned from the start at each iteration.The average of the results at each iteration is used for downstream analysis.
Algorithm For the performances measures, given the binary classification setting (i.e., in RE, presence/ absence of a relationship), the F1-score, precision, recall and accuracy metrics were used, as described below: where TP is True Positives, TN is True Negatives, FP is False Positives and FN represents the False Negatives.When dealing with imbalanced data sets, it is important to assess multiple performance metrics such as precision, recall, and F1-score.Accuracy can be misleading because it does not account for the disproportionate class distribution, leading to inflated scores for the majority class.Precision and recall are more informative, especially for the minority class, as they measure the model's ability to correctly identify positive instances.The F1-score provides a balanced measure of precision and recall, and is particularly useful for overall model evaluation on imbalanced data sets.Therefore, it is important to evaluate models using multiple metrics to obtain a comprehensive understanding of their performance on imbalanced data sets.

Evaluating the active learning performance
Evaluating the active learning performances is done by (1) analysing visually the learning curves of the active learning processes, where it is expected that the learning curves of AL strategies will be above the learning curve of the Random baseline, signifying that they outperform it, by (2) observing the relative difference in their performance compared to Random Sampling across the iterations, by (3) establishing the size of the subset of the data set needed to be labelled to reach a specific performance, and by (4) conducting statistical comparisons between the AL strategies.AL processes can also be observed through the distribution of classes in the labelled set across the iterations.The recommendations of Reyes et al. [62] for conducting statistical comparisons of AL strategies were followed: AL selection strategies are compared either according to the analysis of the area under the learning curve (i.e. the curve obtained by measuring the metrics at each AL iteration), or by analysing the intermediate results (AL iterations) with non-parametric ranking statistical tests.This form of analysis has also been used previously in Ein-Dor et al. [36] and He et al. [63].
Relative difference with the Random baseline.The difference between the performance of the AL strategy and the Random baseline is computed at each AL iteration, matching the fold and repeat of the performance measures.The results are reported as a box plot per data set demonstrating the distribution of the relative differences with the Random baseline for each strategy for each iteration of the AL process.
Analysis of the area under the learning curve.The area under the learning curve (AULC) computes the area below the performance curve.The higher the AULC, the better the performance of the model.
With m as the number of AL iterations, z i as the performance of the model at the i-th iteration and L i as the labelled set at the i-th iteration, the AULC of a learning strategy θ can be approximated with the trapezoidal rule [62], as The AULCs between two sampling strategies are statistically compared by computing the AULCs of each strategy for each data set and then ranking each strategy for each data set.The lower the rank, the higher the performance.We can test if a strategy performs significantly better with consistency across the data sets using a Friedman test [64], a non-parametric statistical test analysing the variance of ranks, as advised in Garcia et al. and Reyes et al. [62,65].The null hypothesis assumes that all strategies have identical performances.
The AULCs can be compared either directly with the Random baseline as control, in that case, we apply the Hommel post-hoc procedure [66] to correct the obtained p-values, or pairwise with all the possible comparisons, in that case, the Bergmann-Hommel post-hoc procedure is applied [67].
Intermediate results analysis.With the analysis of the AULC, important information derived from the intermediate results, i.e. the AL iterations, can be lost as it aggregates all the iterations into one unique value.
To obtain a value representing the difference between two strategies at a specific iteration, the cut-point scoring scheme is used [62], defined here as: with α and β being two selection strategies under comparison, and z y i;j the performance realised by the selection strategy θ at the i-th iteration for the j-th data set, resulting in a matrix of cutpoint scores.
If the dominant strategy α outperforms the strategy β, the difference in performance should be positive and increase during the AL process.Thus, this can be tested with a Page Trend test [68] where the ideal ranking for the cut-point scores is (m, m − 1, � � �, 1).The null hypothesis corresponds to the case where the ranking is randomly ordered.
A statistical test is conducted for each pair of strategies with their matrix of cut-point scores.Each cell c i,j of the resulting table contains the p-value of the statistical test between the strategies associated with row i and column j.A low p-value means that the probability that the ranking is random is also low and thus that the row strategy outperforms the column strategy across the data sets.
It should be noted that for this analysis, the first and last iterations, corresponding to the performances for the training seed and the performance with a full data set, were discarded in order to keep only the intermediate iterations where differences between sampling strategies should be observed.As the size of the active batch is around one-tenth of the total size of the pool set, some data sets had an extra iteration to reach the total size of the pool.Consequently, we discarded that extra iteration in order to have the same number of observations for all the data sets.
Performance for a subset of data set.The average of the performance reached for the full data set was computed across all folds and repeats.It is then used as a threshold to determine for each sampling strategy the percentage of the data set needed to at least reach that performance.
Class imbalance analysis.Class imbalance is an important problem in most biomedical data sets, with few positive examples.Studying the impact of the selection strategy relative to class imbalance is important in order to choose the best AL strategy according to the class distribution in the data set [33,69,70].To better understand how selection strategies work with class imbalance, class imbalance is measured at each iteration of the AL process with a modified version of the Shannon Entropy, as done in Angeli et al. [33]: where C is the total number of classes, c i is the number of instances belonging to the class i and n is the total number of instances.A balance of 0 represents a completely imbalanced data set (i.e., a data set with instances belonging to a unique class) and 1 a perfectly balanced data set (i.e., a data set where each class possesses the same number of instances).We also study the balance of the data sets by computing the fraction of positive examples in the training set.Resulting class imbalance values are averaged over each iteration per pair of data set and strategy.

AL sampling strategies outperform Random Sampling
In Fig 2, we show the analysis of the relative difference of their F1-score with the Random baseline (see Methods Section Relative difference with the Random baseline).Except for Batch-BALD, all the AL strategies outperform the Random baseline.The difference in performance decreases the closer the training set is to the full data set, which is because the difference between the subset selected by an AL strategy and the Random baseline is disappearing.
While these results show that AL strategies perform better, the data sets of AIMED and CDR (Fig 2) are apparently more difficult to use with AL strategies: As can be observed the average AL strategy result is often close to random and the AL strategy variation shows that both good and bad outcomes, relative to the Random baseline, are obtained.Exploring the reasons for this problem provided no conclusive answers.
Visual inspection of the performance curves of AL processes reveals an inconsistency in the performances for those datasets across the iterations, i.e. random drops in the performance, more particularly in the case of recall and F1-score (S2 File).

Uncertainty-based methods are preferred for AL
To evaluate further the performance of the different AL strategies, the AULC for each pair of a strategy and a data set was computed (see Methods Section Analysis of the area under the learning curve, Table 3).
After the visual inspection of the relative difference with the Random baseline (Fig 2 ), uncertainty-based methods, i.e.Entropy, Least-Confident and Margin Sampling strategies, are expected to perform better.It is observed that they are generally the methods with the highest AULC, with the exception on the AIMED data set, where the Core-set strategy performs better.The Random baseline has a better AULC than BatchBALD for the AIMED, BioRED, CDR and DDI data sets, a better AULC than Entropy Sampling for the BioRED and CDR data sets, and a better AULC than Core-set for the BioRED data set.The Least-Confident and Margin Sampling strategies have an overall better AULC than Random Sampling.The average rank for each strategy is computed by averaging the ranking obtained by the strategy across each data set, the lowest being the best ranking (see Methods Section Analysis of the area under the learning curve).The average rank of the strategies as shown in Table 3 reveals that all AL strategies, except for BatchBALD, have a better performance than the Random baseline.Interestingly, Random Sampling has a generally lower standard deviation than the AL strategies, which may be due to the fact that it uniformly chooses samples.When comparing the AL strategies with Random Sampling as a control, only Least-Confident and Margin Sampling strategies show a significant statistical difference (p < 0.05).However, when performing pairwise comparisons between the AL strategies, all the strategies except Entropy Sampling (S1 Fig) outperform BatchBALD with significant statistical differences.
In Table 4, the size of the labelled data set needed to reach the average performance that can be reached with a fully labelled data set was computed for each sampling strategy.A decrease of annotation between 6% to 38% is noted, with AIMED and CDR having the worst reduction.The Least-Confident and Margin Sampling strategies obtained an optimal performance with only a subset of the full data set, confirming the previous observations for the uncertaintybased methods.

Different strategies optimise different metrics
To obtain a more precise insight into the performance of the different AL strategies, the intermediate results and the corresponding performance metrics are analysed at each iteration of the AL process (see Methods Section Intermediate results analysis).With the exception of BatchBALD, all the AL strategies statistically perform better than Random Sampling.Moreover, all AL strategies also outperform BatchBALD.No statistically significant difference is distinguished between the other AL strategies.
Interestingly, the analysis of the intermediate results with the accuracy as the performance measure (S1 Table ) confirms the results obtained using the F1-score as performance metric, with the addition of Entropy and Least-Confident Sampling also outperforming the Core-set strategy.
The intermediate results based on the precision in Table 6 display a clear superiority of the uncertainty-based methods over the other types of methods.This underlines how accurate the model is when it is predicting the positive class.Margin Sampling, the method which had generally the best AULCs based on its F1-score, is once again statistically dominated by both the Entropy and the Least-Confident Sampling methods.While the Core-set strategy is still better than the BatchBALD strategy, it does not statistically outperform the Random Sampling in terms of precision.
Finally, the intermediate results based on the recall in Table 7 display a remarkable dominance of the Core-set method over all the strategies.This shows that a great portion of the positive instances are predicted as positive.BatchBALD is outperformed by all the other AL strategies, except for Random Sampling.Entropy Sampling is the only other method next to the Core-set strategy that also performs better than Random Sampling in terms of recall.In summary, all AL strategies except BatchBALD perform significantly better than Random Sampling based on the F1-score, as observed through the relative difference with Random (Fig 2 ) and the AULCs (Table 3).Core-set outperforms all the strategies when looking at the recall, whereas uncertainty-based strategies outperform the other AL strategies regarding in relation to precision.This means that one could choose among the AL strategies according to the metric that one prefers to optimise.

AL strategies increase the fraction of positive instances in the training set
As both balanced and imbalanced data sets are used (see Table 1), an investigation into how the different AL strategies across the iterations handle the class distribution was performed (see Fig 3).We observed that Random Sampling and BatchBALD keep the initial balance of the data set throughout the AL process, whereas the other AL strategies improve the balance of the data sets, especially in the earlier AL iterations.
Concerning the balance measure results for the unbalanced data sets (Fig 3A and S3 File), all AL strategies, except for BatchBALD, apparently over-sample the minority class at the

Run-time analysis
We measured the time of execution for each AL strategy for the first two iterations for each data set, performing 5-fold cross-validation experiments, and reported the average time per sample selected from the unlabelled set for each iteration (see Table 8).
Uncertainty-based strategies (Entropy, Least-Confident and Margin) are the most time-efficient, as they only require to infer the unlabelled set of samples, whose size decreases throughout the AL process.BatchBALD runs inference 10 times on the unlabelled set, then is followed by a greedy approximation algorithm to create the batch.It means that it should also take less time to select samples across AL iterations, but it can initially take a lot of time to obtain the selected samples.Contrary to the previous strategies, Core-set needs to produce an embedding of all the samples at each iteration, regardless of the size of the labelled and unlabelled set.This operation is followed by a greedy k-center selection, so the computation time should not vary much across the AL iterations.

Discussion
In this paper, it has been shown that AL strategies generally perform better than choosing randomly the instances to label in an iterative fashion in bioRE.
The visual inspection of the F1-score curves of the AL processes revealed some inconsistencies in the performances of the AL processes.This result aligns with earlier deep active learning classification experiments with BERT-based models [36,71].
Nonetheless, the relative difference of the F1-score between the AL strategies and the Random baseline (Fig 2) clearly shows that AL strategies, except for BatchBALD, outperform the Random Sampling.The uncertainty-based strategies appear to be very successful in this respect.The difference between the AL strategies and the Random baseline obviously decreases with the ratio of the data set used for training, as the difference between the subsets selected by an AL strategy and the Random one decreases the closer the training set is to the full data set.
Additionally, the uncertainty-based methods achieve an optimal performance faster than other selection strategies, an observation shared with experiments on text classification with deep AL methods [34,36,50] and confirmed with the analysis of the AULCs, especially for the Least-Confident and Margin Sampling strategies (Table 3).
In an AL setting, one can decide to stop either when the trained model reaches a desired performance or when a fixed number of samples have been labelled, yet this threshold may depend on the data set itself: As one can observe in Fig 2 and Table 4, it is actually difficult to find a common threshold for all the data sets.The aim of AL in that case is to obtain an optimal performance with only a subset of the unlabelled data set.Moreover annotating unlabelled instances may be costly, as for instance in bioRE, requiring limits on the amount of labelling to be imposed.One could also detect that a sufficient amount of labelling has been done by stopping when the performance improvement is lower than some minimal threshold.
The study of the intermediate results (Tables 5-7) refine our comparison between the strategies.Uncertainty-based methods optimise the accuracy and precision metrics, while Core-set, the diversity-based method, aims for the optimisation of the recall metric.According to our results, there is room for improvement towards selection methods that optimise precision and recall simultaneously.Generally, one would choose precision as the most relevant metric when one wants to be sure that what is predicted as positive is actually positive and recall when one wants to be sure to capture as many positive instances as possible.In bioRE, one could wish for example for better precision in the case of protein-protein interaction to ensure that one has access to accurate information, whereas one would choose recall for drug-drug interaction as this type of information could be essential for treatment and to avoid adverse drug reaction, so increasing the number of false positives would not be an hindrance in such case.While one could think that a combination of both uncertainty and diversity strategies could lead to better performance based on these observations, BatchBALD, a hybrid method of both these types of strategies, performed overall even worse than Random Sampling.BatchBALD was however observed in their initial paper to not perform well for unbalanced data sets and to decrease in performance with the active batch size, which explains the bad performances in our experiments [56].Another possibility is to explore methods that actually combine an uncertainty-based method with an approach such as Core-set, as was proposed in the filtered active submodular selection (FASS) method [72], or to create a new method exclusively for bioRE.In the latter case, the results of the current study may serve as a baseline for benchmarking this new method.
Additionally, the analysis of how the different AL strategies select the instances according to the class imbalance uncovered that uncertainty-based strategies had the tendency to sample more positive instances, especially in the earlier iterations of the AL process, improving the balance of the training set as a side effect.This leads to a better generalisation of the model and allows it to reach an earlier optimal performance in comparison to the other strategies.The effect of balancing the data set during the training phase of the AL iteration was not investigated, however it could have a positive impact on the performance of the model, at the cost of the size of the training data as we would need to undersample the majority class.
Finally, the run-time analysis of the different AL methods highlights the cost of using databased and hybrid methods, which we show here are more computationally expensive, versus uncertainty-based methods, which are more time efficient.Our results argue in favour of the use of uncertainty-based AL strategies, especially as one should be concerned with both the economical and environmental costs of employing deep learning [73].
Our work focused on the empirical study for bioRE in a binary classification setting.A natural extension for this work would be the study of multi-class RE, as the identification of the relation type between entities is an important matter in bioRE [74].While a wide variety of deep learning models were studied for bioRE, we explored in this work only BERT-based language models.An additional inquiry would be to explore other deep learning models, such as GNNs or generative large language models, which have proven to perform well on the RE task [75], to confirm our results for other deep learning architectures.Another direction for future works is the use of a self-supervision prior combined with AL, which has been shown to increase the performance of the AL process, as it uses the unlabelled data to obtain the underlying representation of the data set [76], or combined with the AL process in a semi-supervised setting [77].Moreover, an extensive study on the mitigation of the class imbalance during the training or by using AL strategy aiming to balance the dataset [69,70] are an additional path to explore as class imbalance is a major concern in bioRE.More investigation on these topics are necessary before being able to use AL for bioRE in the most efficient way.

Conclusion
Building high-quality bioRE data sets requires both time and expertise in order to develop accurate RE tools to help with the biocuration of the increasing number of biomedical articles.AL methods aim to reduce the need of labelling by selecting the most informative instances to label among the unlabelled data.Knowing which AL methods perform the best with a specific type of data set allows for the development of annotation tools for bioRE using AL to efficiently select instances to be labelled by annotators, such as Paladin [78], AlpacaTag [79] or Label Sleuth [80] for text classification.
To the best of our knowledge, this work provides a first systematic study of bioRE with deep AL methods.With our experiments, it was shown that uncertainty-based methods, especially the Least-Confident strategy, are the methods with overall the best performance, as was also observed in the case of text classification.However, if one aims to optimize recall in particular, the use of the Core-set selection strategy is advised.

Fig 1 .
Fig 1. Diagram representing an AL scenario.The data set can be split into the test set, used to evaluate the performance of the machine learning model after each AL iteration, the labelled set and the unlabelled set.An active learning iteration is divided into 3 steps: training, selection and labelling.First, the labelled set is used to train a model, which is then used to select in the unlabelled set the most informative instances to label according to a strategy, such as the instances the model is the most uncertain about.Finally, those selected instances are labelled by an oracle (generally a human expert) and added to the labelled set.The AL loop can be stopped once a stop criterion is reached, such as a specific performance of the trained model on a test set, a number of AL iterations or a specific amount of instances that are to be labelled.https://doi.org/10.1371/journal.pone.0292356.g001 7 data sets × 5 folds × 3 seeds × (1 base model + (6 strategies × 10 iterations))).The experiments were conducted on a server with Ubuntu Desktop 20.04.5 LTS (GNU/Linux 5.15.0-56-generic x86_64) operating system, Nvidia driver 470.161.03,CUDA version 11.4, with 32GB RAM on 2 Asus GTX 1080 TI GPUs.

Fig 2 .
Fig 2. Distribution of the relative difference between the AL strategies and the Random baseline across the AL iterations.Y-values bigger than 0 indicate that the selection technique is performing better than the Random baseline.Except for BatchBALD, all AL strategies tend to have a positive difference compared to random This difference decreases with the increase of the size of the data set used for training.Results outside of 1.5*inter-quartile range from the first quartile and third quartile are removed for clarity.Boxplots containing the outliers are available in the S1 File.https://doi.org/10.1371/journal.pone.0292356.g002

Fig 3 .
Fig 3. Examples of balance measures using Shannon Entropy (see Eq 7) across AL iterations.Results for (A) an unbalanced data set, CDR, and (B) a balanced data set, Nary-DGV.The measure provides an average over each iteration.The same behaviours were observed for the other data sets of the same distribution (S3 File).For each data set, each panel highlights the results for the specific AL strategy and shows the others in grey.https://doi.org/10.1371/journal.pone.0292356.g003

Fig 4 .
Fig 4. Fraction of positive instances in the training set for the unbalanced data sets.Data sets are as follows, (A) AIMED, (B) BioRED, (C) CDR, (D) ChemProt and (E) DDI.The measures are averaged over each iteration.For each data set, each panel highlights the results for a strategy and greys out the others.https://doi.org/10.1371/journal.pone.0292356.g004

Fig 5 .
Fig 5. Fraction of positive instances in the training set for the balanced data sets.Data sets are as follows, (A) Nary-DGV and (B) Nary-DV.The measures are averaged over each iteration.For each data set, each panel highlights the results for a strategy and greys out the others.https://doi.org/10.1371/journal.pone.0292356.g005

1
Active learning analysis framework

Table 4 . Percentage of data set needed to reach at least the average performance with the total data set for each AL strategy.
Best percentage of annotated data set per data set are in bold.BB = BatchBALD, CS = Core-set, E = Entropy, LC = Least-Confident, M = Margin https://doi.org/10.1371/journal.pone.0292356.t004

Table 3 . Average AULC (as in Eq 5) and standard deviation over the folds for each pair of data set and selection strategy based on F1-score.
Best AULC per data set are in bold.Average rank corresponds to the average over the column of the ranking of the strategies for each dataset.The lower the rank, the better.P-value is the result of the Friedman statistical test with a Hommel post-hoc procedure compared with the Random Sampling baseline.BB = BatchBALD, CS = Core-set, E = Entropy, LC = Least-Confident, M = Margin, R = Random. https://doi.org/10.1371/journal.pone.0292356.t003

Table 5
lists the p-values for the Page-Trend statistical tests with the F1-score as the performance measure between each pair of AL strategies.Page-Trend tests if the ranking of the cutpoint scores of two strategies is positively increasing with the AL iterations, while the null hypothesis assumes a random order (see Methods Section Intermediate results analysis).

Table 5 . P-values for the Page trend statistical test with intermediate results based on F1-score.
Dominant strategies are in the row labels, i.e. the hypothesis tested is the strategy in the row performs better than the strategy in the column.P-values below 0.05 are in bold.BB = BatchBALD, CS = Core-set, E = Entropy, LC = Least-Confident, M = Margin, R = Random.https://doi.org/10.1371/journal.pone.0292356.t005

Table 6 . P-values for the Page trend statistical test with intermediate results based on precision.
Dominant strategies are in the row labels, i.e. the hypothesis tested is the strategy in the row performs better than the strategy in the column.P-values below 0.05 are in bold.BB = BatchBALD, CS = Core-set, E = Entropy, LC = Least-Confident, M = Margin, R = Random.https://doi.org/10.1371/journal.pone.0292356.t006

Table 7 . P-values for the Page trend statistical test with intermediate results based on recall.
Dominant strategies are in the row labels, i.e. the hypothesis tested is the strategy in the row performs better than the strategy in the column.P-values below 0.05 are in bold.BB = BatchBALD, CS = Core-set, E = Entropy, LC = Least-Confident, M = Margin, R = Random.

Table 8 . Execution time average over all the data sets per unlabelled sample for the two first iteration of the AL process for each AL strategy with their standard deviation.
BB = BatchBALD, CS = Core-set, E = Entropy, LC = Least-Confident, M = Margin, R = Random.Results were rounded at the first decimal.https://doi.org/10.1371/journal.pone.0292356.t008